Photo by Jovana Askrabic on Unsplash
The goal of this assignment is to introduce you to R, RStudio, Git, and GitHub, which you’ll be using throughout the course both to learn the data science concepts discussed in the course and to analyze real data and come to informed conclusions.
This assignment assumes that you have reviewed the lectures titled “Meet the toolkit: Programming” and “Meet the toolkit: version control and collaboration”. If you haven’t yet done so, please pause and complete the following before continuing.
We’ve already thrown around a few new terms, so let’s define them before we proceed.
As the course progresses, you are encouraged to explore beyond what the assignments dictate; a willingness to experiment will make you a much better programmer! Before we get to that stage, however, you need to build some basic fluency in R. First, we will explore the fundamental building blocks of all of these tools.
Before you can get started with the analysis, you need to make sure you:
If you failed to confirm any of these, it means you have not yet completed the prerequisites for this assignment. Please go back to Prerequisites and complete them before continuing the assignment.
**IMPORTANT:** If there is no GitHub repo created for you for this assignment, it means I didn't have your GitHub username as of when I assigned the homework. Please let me know your GitHub username asap, and I can create your repo.
For each assignment in this course you will start with a GitHub repo that I created for you and that contains the starter documents you will build upon when working on your assignment. The first step is always to bring these files into RStudio so that you can edit them, run them, view your results, and interpret them. This action is called cloning.
Then you will work in RStudio on the data analysis, making commits along the way (snapshots of your changes) and finally push all your work back to GitHub.
The next few steps will walk you through the process of getting information of the repo to be cloned, cloning your repo in a new RStudio Cloud project, and getting started with the analysis.
On GitHub, click on the green Code button, select HTTPS (this might already be selected by default, and if it is, you’ll see the text Use Git or checkout with SVN using the web URL jas in the image on the right). Click on the clipboard icon 📋 to copy the repo URL.
Go to posit.cloud and then navigate to the course workspace via the left sidebar. It’s very important that you do this for two reasons:
Before you proceed, confirm that you are in the course workspace by checking out what’s on your top bar in RStudio Cloud.
In RStudio, click on the down arrow next to New Project and then choose New Project from Git Repository.
In the pop-up window, paste the URL you copied from GitHub, make sure the box for Add packages from the base project is checked (it should be, by default) and then click OK.
RStudio is comprised of four panes.
2 + 2 here and hit
enter, what do you get?x <- 2 in
the Console and hit enter, what do you get in the
Environment pane? Importantly, this pane is also where
the Git interface lives. We will be using that
regularly throughout this assignment.Before we introduce the data, let’s warm up with some simple exercises.
The top portion of your R Markdown file (between the three dashed lines) is called **YAML**. It stands for "YAML Ain't Markup Language". It is a human friendly data serialization standard for all programming languages. All you need to know is that this area is called the YAML (we will refer to it as such) and that it contains meta information about your document.
Open the R Markdown (Rmd) file in your project, change the author name to your name, and knit the document.
Then Go to the Git pane in your RStudio.
You should see that your Rmd (R Markdown) file and its output, your md file (Markdown), are listed there as recently changed files.
Next, click on Diff. This will pop open a new window that shows you the difference between the last committed state of the document and its current state that includes your changes. If you’re happy with these changes, click on the checkboxes of all files in the list, and type “Update author name” in the Commit message box and hit Commit.
You don’t have to commit after every change, this would get quite cumbersome. You should consider committing states that are meaningful to you for inspection, comparison, or restoration. In the first few assignments we will tell you exactly when to commit and in some cases, what commit message to use. As the semester progresses we will let you make these decisions.
Now that you have made an update and committed this change, it’s time to push these changes to the web! Or more specifically, to your repo on GitHub. Why? So that others can see your changes. And by others, we mean the course teaching team (your repos in this course are private to you and us, only). In order to push your changes to GitHub, click on Push.
This will prompt a dialogue box where you first need to enter your user name, and then your password. This might feel cumbersome. Bear with me… I will teach you how to save your password so you don’t have to enter it every time. But for this one assignment you’ll have to manually enter each time you push in order to gain some experience with it.
Thought exercise: Which of the above steps (updating the YAML, committing, and pushing) needs to talk to GitHub?1
Updating the YAML is also a local operation, it doesn’t require any communication with GitHub. Committing is a local operation, it doesn’t require any communication with GitHub. Pushing requires communication with GitHub, as it is the step that sends your changes to GitHub.
R is an open-source language, and developers contribute functionality to R via packages. In this assignment we will use the following packages:
We use the library() function to load packages. In your
R Markdown document you should see an R chunk labelled
load-packages which has the necessary code for loading both
packages. You should also load these packages in your Console, which you
can do by sending the code to your Console by clicking on the
Run Current Chunk icon (green arrow pointing right
icon).
Note that these packages also get loaded in your R Markdown environment when you Knit your R Markdown document.
The city of Seattle,
WA has an open data portal that includes pets registered in the
city. For each registered pet, we have information on the pet’s name and
species. The data used in this exercise can be found in the
openintro package, and it’s called
seattlepets. Since the dataset is distributed with the
package, we don’t need to load it separately; it becomes available to us
when we load the package.
You can view the dataset as a spreadsheet using the
View() function. Note that you should not put this function
in your R Markdown document, but instead type it directly in the
Console, as it pops open a new window (and the concept of popping open a
window in a static document doesn’t really make sense…). When you run
this in the console, you’ll see the following data
viewer window pop up.
View(seattlepets)
You can find out more about the dataset by inspecting its
documentation (which contains a data dictionary, name
of each variable and its description), which you can access by running
?seattlepets in the Console or using the Help menu in
RStudio to search for seattlepets.
By running ?seattlepets in the Console or using the Help
menu in RStudio to search for seattlepets, you’ll see
documentation about the dataset pop up in the Help pane which contains
information such as the source of the data, a brief description, and a
data dictionary that describes each variable in the dataset.
According to the data dictionary, how many pets are included in this dataset?
There are a total of 52,519 pets in the dataset with 7 variables for each pet.
# Get the number of entries (rows) and variables (columns)
nrow(seattlepets)
## [1] 52519
ncol(seattlepets)
## [1] 7
🧶 ✅ ⬆️ Write your answer in your R Markdown document under Exercise 1, knit the document, commit your changes with a commit message that says “Completed Exercise 1”, and push. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
Again, according to the data dictionary, how many variables do we have for each pet?
There are 7 variables for each pet in the dataset:
license_issue_date: Date the animal was registered
with Seattle
license_number: Unique identifier for each pet
license
animal_name: Name of the pet
species: Species of the pet (e.g., Dog, Cat,
etc.)
primary_breed: Primary breed of the pet
secondary_breed: Secondary breed of the pet (if
applicable)
zip_code: Zip code animal is registered in
colnames(seattlepets)
## [1] "license_issue_date" "license_number" "animal_name"
## [4] "species" "primary_breed" "secondary_breed"
## [7] "zip_code"
🧶 ✅ ⬆️ Write your answer in your R Markdown document under Exercise 2, knit the document, commit your changes with a commit message that says “Completed Exercise 2”, and push. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
What are the three most common pet names in Seattle?
The most common pet names are Lucy, Charlie, and Luna.
To do this you will need to count the frequencies of each pet name and display the results in descending order of frequency so that you can easily see the top three most popular names. The following code does exactly that.
The two lines of code can be read as "Start with the seattlepets data frame, and then count the animal_names, and display the results sorted in descending order. The 'and then' in the previous sentence maps to %>%, the pipe operator, which takes what comes before it and plugs it in as the first argument of the function that comes after it."
The code below takes the seattlepets dataset from the
openintro package and produces a frequency table of pet
names. It first standardizes the animal_name variable by
converting all names to lowercase and trimming any leading or trailing
spaces. It then removes missing (NA) or blank entries to
ensure only valid names remain. Finally, it counts the occurrences of
each unique name, sorts them from most to least common, and labels the
frequency column as count for clarity.
popular_names <- seattlepets %>%
mutate(animal_name = animal_name |>
tolower() |>
trimws()) %>% # remove leading/trailing spaces
filter(!is.na(animal_name), animal_name != "") %>%
count(animal_name, sort = TRUE, name = "count")
popular_names
## # A tibble: 13,775 × 2
## animal_name count
## <chr> <int>
## 1 lucy 440
## 2 charlie 387
## 3 luna 357
## 4 bella 331
## 5 max 273
## 6 daisy 261
## 7 molly 240
## 8 jack 232
## 9 lily 232
## 10 stella 227
## # ℹ 13,765 more rows
head(popular_names, 3)
## # A tibble: 3 × 2
## animal_name count
## <chr> <int>
## 1 lucy 440
## 2 charlie 387
## 3 luna 357
🧶 ✅ ⬆️ Write your answer in your R Markdown document under Exercise 3. In this exercise you will not only provide a written answer but also include some code and output. You should insert the code in the code chunk provided for you, knit the document to see the output, and then write your narrative for the answer based on the output of this function, and knit again to see your narrative, code, and output in the resulting document. Then, commit your changes with a commit message that says “Completed Exercise 3”, and push. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
Let’s also look to see what the most common pet names are for various
species. For this we need to first group_by() the
species, and then do the same counting we did before.
Looks like many of those NAs were cats. Poor unnamed kitties…
seattlepets %>%
group_by(species) %>%
count(animal_name, sort = TRUE)
## # A tibble: 16,823 × 3
## # Groups: species [4]
## species animal_name n
## <chr> <chr> <int>
## 1 Cat <NA> 406
## 2 Dog Lucy 337
## 3 Dog Charlie 306
## 4 Dog Bella 249
## 5 Dog Luna 244
## 6 Dog Daisy 221
## 7 Dog Cooper 189
## 8 Dog Lola 187
## 9 Dog Max 186
## 10 Dog Molly 186
## # ℹ 16,813 more rows
But this output isn’t exactly what we wanted. We wanted to know the most common cat and dog names, but there are barely any cats present in this output! This is because there are more dogs than cats in the dataset overall. We can confirm this by counting the various species in the data.
6 pigs in the city? Ok… But we'll continue with cats and dogs.
seattlepets %>%
count(species, sort = TRUE)
## # A tibble: 4 × 2
## species n
## <chr> <int>
## 1 Dog 35181
## 2 Cat 17294
## 3 Goat 38
## 4 Pig 6
Let’s search for the top 5 cat and dog names. To do this, we can use
the slice_max() function. The first argument in the
function is the variable we want to select the highest values of, which
is n. The second argument is the number of rows to select,
which is n = 5 for the top 5. It may be a bit confusing
that both of these are n, but this is because we already
have a variable called n in the data frame.
seattlepets %>%
group_by(species) %>%
count(animal_name, sort = TRUE) %>%
slice_max(n, n = 5)
## # A tibble: 53 × 3
## # Groups: species [4]
## species animal_name n
## <chr> <chr> <int>
## 1 Cat <NA> 406
## 2 Cat Luna 111
## 3 Cat Lucy 102
## 4 Cat Lily 86
## 5 Cat Max 83
## 6 Dog Lucy 337
## 7 Dog Charlie 306
## 8 Dog Bella 249
## 9 Dog Luna 244
## 10 Dog Daisy 221
## # ℹ 43 more rows
This output provides the top 5 names for each species along with their respective counts. However, the results are sorted by count across all species, which means that the top names for one species may appear before those of another species.
Based on the previous output we can easily identify the most common
cat and dog names in Seattle, but the output is sorted by n
(the frequencies) as opposed to being organized by the
species. Build on the pipeline to arrange the results so
that they’re arranged by species first, and then
n. This means you will need to add one more step to the
pipeline, and you have two options: arrange(species, n) or
arrange(n, species). You should try both and decide which
one organizes the output by species and then ranks the names in order of
frequency for each species.
popular_names <- seattlepets %>%
mutate(animal_name = animal_name |>
tolower() |>
trimws()) %>%
filter(!is.na(animal_name), animal_name != "") %>%
count(species, animal_name, sort = TRUE, name = "count")
popular_names
## # A tibble: 16,669 × 3
## species animal_name count
## <chr> <chr> <int>
## 1 Dog lucy 338
## 2 Dog charlie 306
## 3 Dog bella 249
## 4 Dog luna 246
## 5 Dog daisy 221
## 6 Dog cooper 189
## 7 Dog max 189
## 8 Dog lola 188
## 9 Dog molly 186
## 10 Dog stella 185
## # ℹ 16,659 more rows
popular_names <- popular_names %>%
group_by(species) %>%
arrange(species, desc(count), animal_name)
popular_names
## # A tibble: 16,669 × 3
## # Groups: species [4]
## species animal_name count
## <chr> <chr> <int>
## 1 Cat luna 111
## 2 Cat lucy 102
## 3 Cat lily 86
## 4 Cat max 83
## 5 Cat bella 82
## 6 Cat charlie 81
## 7 Cat oliver 73
## 8 Cat jack 65
## 9 Cat sophie 59
## 10 Cat leo 54
## # ℹ 16,659 more rows
# Select the top 10 most common pet names for each species and display their counts
top_species_names <- popular_names %>%
group_by(species) %>%
slice_max(order_by = count, n = 10, with_ties = FALSE)
top_species_names
## # A tibble: 35 × 3
## # Groups: species [4]
## species animal_name count
## <chr> <chr> <int>
## 1 Cat luna 111
## 2 Cat lucy 102
## 3 Cat lily 86
## 4 Cat max 83
## 5 Cat bella 82
## 6 Cat charlie 81
## 7 Cat oliver 73
## 8 Cat jack 65
## 9 Cat sophie 59
## 10 Cat leo 54
## # ℹ 25 more rows
Here, we can see the top 10 names for each species along with their counts, organized by species and then by frequency within each species.
Looking at the data more closely:
Dogs: The most popular dog names are “lucy” (338), “charlie” (306), and “bella” (249). These three names significantly outrank other dog names, with counts over 200. There’s a gradual decline in popularity after the top three names.
Cats: The most popular cat names are “luna” (111), “lucy” (102), and “lily” (86). Cat names show a more even distribution in popularity compared to dog names, with a more gradual decrease in frequency from top to bottom.
Goats: All goat names appear just once in the dataset, suggesting there are few registered goats in Seattle, and their names are quite diverse with no clear pattern of popularity.
Pigs: Similar to goats, all pig names appear only once, indicating a small population of registered pigs with unique names.
It’s interesting to note that “lucy” appears in the top names for both cats and dogs, showing some cross-species name preferences. The data also reveals that dogs are much more commonly registered pets in Seattle than cats, as evidenced by the higher counts for dog names compared to cat names.
Now that we have the top 10 names for each species, we can create a contingency table that displays the counts of these names across different species.
# Start from popular_names (species, animal_name, count)
contingency_tbl <- top_species_names %>%
pivot_wider(
names_from = species, # columns become species
values_from = count, # values are counts
values_fill = NA # fill missing combos with 0
)
contingency_tbl
## # A tibble: 30 × 5
## animal_name Cat Dog Goat Pig
## <chr> <int> <int> <int> <int>
## 1 luna 111 246 NA NA
## 2 lucy 102 338 NA NA
## 3 lily 86 NA NA NA
## 4 max 83 189 NA NA
## 5 bella 82 249 NA NA
## 6 charlie 81 306 NA NA
## 7 oliver 73 NA NA NA
## 8 jack 65 NA NA NA
## 9 sophie 59 NA NA NA
## 10 leo 54 NA NA NA
## # ℹ 20 more rows
# sort by Dog first, then Cat, then name
contingency_tbl <- contingency_tbl %>%
arrange(desc(Dog), desc(Cat), desc(Goat), desc(Pig), animal_name)
contingency_tbl
## # A tibble: 30 × 5
## animal_name Cat Dog Goat Pig
## <chr> <int> <int> <int> <int>
## 1 lucy 102 338 NA NA
## 2 charlie 81 306 NA NA
## 3 bella 82 249 NA NA
## 4 luna 111 246 NA NA
## 5 daisy NA 221 NA NA
## 6 max 83 189 NA NA
## 7 cooper NA 189 NA NA
## 8 lola NA 188 NA NA
## 9 molly NA 186 NA NA
## 10 stella NA 185 NA NA
## # ℹ 20 more rows
When organizing the data by species first and then by
n, we can clearly see the most common names for each
species:
🧶 ✅ ⬆️ Write your answer in your R Markdown document under Exercise 4. In this exercise you’re asked to complete the code provided for you. You should insert the code in the code chunk provided for you, knit the document to see the output, and then write your narrative for the answer based on the output of this function, and knit again to see your narrative, code, and output in the resulting document. Then, commit your changes with a commit message that says “Completed Exercise 4”, and push. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
The following visualization plots the proportion of dogs with a given name versus the proportion of cats with the same name. The 20 most common cat and dog names are displayed. The diagonal line on the plot is the \(x = y\) line; if a name appeared on this line, the name’s popularity would be exactly the same for dogs and cats.
The scatter plot above visualizes the relationship between the popularity of names among cats and dogs in Seattle. Each point represents a specific pet name, with its position determined by the proportion of cats (x-axis) and dogs (y-axis) that bear that name. The diagonal reference line indicates equal popularity between the two species.
Overall Name Distribution
Diagonal Reference Line: The diagonal line represents equal popularity between cats and dogs. Names falling exactly on this line would have the same proportional representation in both populations.
Clustering Pattern: Most names cluster in the lower left portion of the plot (between 0.002-0.006 on both axes), indicating that pet name distributions have a “long tail” - a few very popular names and many less common ones.
Species Preferences
Statistical Insights
Correlation Analysis: The plot shows a positive correlation between cat and dog name preferences, suggesting that human naming tendencies transcend pet species. However, the correlation is moderate, not strong, indicating distinct species-specific preferences.
Outliers: “Lucy” and “Charlie” appear as statistical outliers in dog naming popularity, while “Luna” is an outlier for cats.
Proportion Range: Dog name proportions extend higher (up to 0.01) than cat names (maximum around 0.008), suggesting slightly more naming concentration in dogs.
Cultural Implications
This visualization reveals how pet naming conventions reflect human cultural preferences while also showing species-specific patterns. The differences may reflect perceptions of personality traits associated with each species or gender associations with certain names. Names like “Lucy” may be perceived as fitting dog personalities, while “Lily” may be seen as more suitable for cat temperaments.
🧶 ✅ ⬆️ Now is a good time to commit and push your changes to GitHub with an appropriate commit message. Commit and push all changed files so that your Git pane is cleared up afterwards. Make sure that your last push to the repo comes before the deadline. You should confirm that what you committed and pushed are indeed in your repo that we will see by visiting your repo on GitHub.
Only pushing requires talking to GitHub, this is why you’re asked for your password at that point.↩︎